Joe Clarke & Ahyoung Lim
From cumulative to incident case counts in the PAHO PLISA Health Information Platform for the Americas dengue database
It is important to note that the cumulative datasets are for reports of “all dengue cases : suspected, probable, confirmed, non-severe and severe cases, and deaths. probable dengue cases : person who has a fever or history of fever for 2-7 days duration, two or more symptoms of dengue and one serological test positive or epidemiological nexus with confirmed dengue case 14 days before onset of symtpoms”.
This dataset, while it has “select epidemiological week” slider options for all available epidemiological weeks of every yer, the epidemiological week for which information is actually available/reported can differ from that on the slider. Some countries have missing weeks of data.The the extent of missing data varies greatly between countries, and over time. While moving from cumulative to incident weekly counts for countries with 100% reporting of sequential cumulative counts may be straightforward, by subtracting the previous weeks count from the following weeks count, this is not so for countries with missing data. There is already incident data available through the “Chart” and “Epi Curve” options, however this is for non-severe cases of dengue only. This is not the same as “total of dengue cases” as measured in the cumulative table, which is our priority for this dataset.
We want to investigate if this incident dataset can be leveraged to gap-fill the cumulative dataset when moving to incident counts, and what the difference between the two is.
PLISA PAHO portal only permits download of all countries cumulative data sets in a week by week fashion, by moving the epidemiological week slider while selecting all countries. These require the file format to be encoded differently, and re-saved in this format for further processing.
The files are loaded and a new directory created for processed data.
file_list<-list.files("raw_data/source_data"); file_list
old_dir<-paste0("raw_data/source_data")
new_dir<-paste0("raw_data/formatted_data")
#Encoding & save again
for(j in 1:length(file_list)){
oldfiledir <- paste0(old_dir, "/", file_list[j])
newfiledir <- paste0(new_dir, "/", file_list[j])
fileConn <- file(newfiledir)
writeLines(readLines(con <- file(oldfiledir, encoding = "UCS-2LE")), fileConn)
close(fileConn)
}
csv_files <-list.files(new_dir, full.names=T, recursive=F, pattern="\\.csv$") #directory names for loading files
csv_files <- csv_files[-c(1)];csv_files
temp <- lapply(csv_files, read.table, sep="\t", colClasses = rep("character",16), header=TRUE)
merge <- rbindlist(temp)
#Column rename and select
all_dt <- merge %>%
select(Country.or.Subregion, Serotype, Year, Confirmed, Deaths,
Epidemiological.Week..a., Severe.Dengue..d., Total.of.Dengue.Cases..b.) %>%
rename(adm_0_name = Country.or.Subregion,
year = Year,
epidemiological_week = Epidemiological.Week..a.,
severe_dengue_cumulative = Severe.Dengue..d.,
dengue_total_cumulative = Total.of.Dengue.Cases..b.) %>%
filter(!epidemiological_week %in% c("-")) # * AL: why does it have some data with epidemiological week being "-" here?
#convert epidemiological week and year to numeric
all_dt$epidemiological_week <- as.numeric(all_dt$epidemiological_week)
all_dt$year <- as.numeric(all_dt$year)
#remove duplicates
all_dt <- all_dt %>%
distinct()%>%
arrange(adm_0_name, year, epidemiological_week)
#add the end date of the cumulative count for each row ("cum_end_date").
for(i in 1:nrow(all_dt)){
all_dt$cum_end_date[i] <- as.character(epiweekToDate(all_dt$year[i], all_dt$epidemiological_week[i])$d1)
} #*AL: maybe try "lubridate::epiweek()" here? it's faster.
#add cum_start_date based on the cum_end_date of the previous row. This is done by lagging cum_end_week because there may be cases where the cumulative count is not reported for consecutive weeks.
all_dt <- all_dt %>%
arrange(adm_0_name, year, epidemiological_week)%>%
group_by(adm_0_name, year)%>%
mutate(cum_start_date = ymd(lag(cum_end_date))+1)
#create "complete" dataset which has a row for every epidemiological week for every year, with NA for the dengue_total count if no data reported for this week. Make sure number of epidemiological weeks in each year is correct according to epidemiological calendars for each year (some have 52 weeks, others have 53).
all_dt <- all_dt %>% ungroup()%>%
tidyr::complete(adm_0_name, year, epidemiological_week)%>%
filter(!(year %in% c(2015:2019, 2021:2022) & epidemiological_week == 53))
all_dt %>%
group_by(adm_0_name) %>% tally()
#add cum_end_date again for rows that were missed from the original data but were filled by 'complete' function above. Add the start/end date of epidemiological week for all rows("calendar_start_date" and "calendar_end_date").
for(i in 1:nrow(all_dt)){
all_dt$calendar_end_date[i] <- as.character(epiweekToDate(all_dt$year[i], all_dt$epidemiological_week[i])$d1)
if (is.na(all_dt$cum_end_date[i])==FALSE) next
all_dt$cum_end_date[i] <- as.character(epiweekToDate(all_dt$year[i], all_dt$epidemiological_week[i])$d1)
}
# add cum_start_date, calendar_start_date and source_id
all_dt <- all_dt %>%
mutate(cum_end_date = ymd(cum_end_date))%>%
mutate(cum_start_date = ifelse(is.na(cum_start_date), as.character(ymd(cum_end_date)-6), paste0(cum_start_date)))%>%
mutate(cum_interval = interval(cum_start_date, cum_end_date) %/% weeks(1)+1)%>%
mutate(calendar_end_date = ymd(calendar_end_date))%>%
mutate(calendar_start_date = ymd(calendar_end_date)-6)%>%
mutate(source_id= ifelse(is.na(dengue_total_cumulative)==FALSE, paste0("all_PAHO_countries_PAHO_2014_to_2022_EW",epidemiological_week), NA))
#summary(all_dt$cum_interval)
#administrative unit codes
all_dt$adm_0_name <- stri_trans_general(all_dt$adm_0_name,"Latin-ASCII")
gaul_codes_world <- read.csv(file="ref_data/gaul_codes_world.csv")
gaul_codes_world <- gaul_codes_world %>%
select(ADM0_NAME, ADM0_CODE) %>%
rename(adm_0_name = ADM0_NAME)%>%
rename(adm_0_code = ADM0_CODE)%>%
distinct()
#plyr::count(all_dt$adm_0_name) #52 countries
#plyr::count(gaul_codes_world$adm_0_name) #273 countries
ctry_check <- merge(all_dt, gaul_codes_world, by="adm_0_name", all.x=T)
#check the country names that did not match with gaul names
ctry_check %>% group_by(adm_0_name, adm_0_code)%>% tally() %>% filter(is.na(adm_0_code))
ctry_lookup <- c(
# gaul name to paho name
"Turks and Caicos islands" = "Turks and Caicos Islands",
"British Virgin Islands" = "Virgin Islands (UK)",
"United States Virgin Islands" = "Virgin Islands (US)"
)
# PAHO countries without gaul codes
#"Bonaire, Saint Eustatius and Saba"
#"Curacao"
#"Saint Barthelemy",
#"Saint Martin", * is this the same country as Sint Maarten?
#"Sint Maarten",
gaul_codes_world <- gaul_codes_world %>%
mutate(adm_0_name = recode(adm_0_name, !!!ctry_lookup))
all_dt <- merge(all_dt, gaul_codes_world, by="adm_0_name", all.x=T)
Each data set is then processed into two data hierarchies, primary (A) and secondary (B), with calendar dates added. Epidemiological week and year remain, as these are useful for processing later on.
The case counts are left in cumulative counts at this stage. This may be useful for users who prefer to deal in cumulative counts. here, the calendar start date is fixed at the beginning of epidemiological week one of each year, and the calendar end date is moved forward to account for the cumulative count.
#select columns for master table_A and change to date format
all_dt_A <- all_dt %>%
select(adm_0_name, adm_0_code, year, epidemiological_week,
calendar_start_date, calendar_end_date,
cum_start_date, cum_end_date, cum_interval,
dengue_total_cumulative, source_id
)%>%
mutate(calendar_start_date = ymd(calendar_start_date),
calendar_end_date = ymd(calendar_end_date),
cum_start_date = ymd(cum_start_date),
cum_end_date = ymd(cum_end_date))
all_dt_A_tidy <- all_dt_A %>%
distinct() %>%
mutate_at(c(2), ~replace(., is.na(.), 0)) %>%
group_by(adm_0_name, year) %>%
arrange(adm_0_name, year, epidemiological_week)
#Check admin codes and fill gaps
admin_NA <- all_dt_A_tidy %>%
dplyr::filter(adm_0_code==0)
# *AL: please explain this part - why adm_0_code assigned for Virgin islands only?
#all_dt_A_tidy$adm_0_code[all_dt_A_tidy$adm_0_name=="Virgin Islands (UK)"]<-"39"
#all_dt_A_tidy$adm_0_code[all_dt_A_tidy$adm_0_name=="Virgin Islands (US)"]<-"258"
write.csv(all_dt_A_tidy,"processed_data/all_dt_A.csv", row.names = FALSE)
#Dataset B/Secondary dataset, which contains information on : disease severity, serotype, mortality.
all_dt_B <- all_dt %>%
select(adm_0_name, adm_0_code,
calendar_start_date, calendar_end_date,
dengue_total_cumulative, Deaths, Serotype, severe_dengue_cumulative, source_id)%>%
rename(dengue_total=dengue_total_cumulative,
severe_dengue=severe_dengue_cumulative)%>%
mutate(calendar_start_date = ymd(calendar_start_date),
calendar_end_date = ymd(calendar_end_date))%>%
filter(dengue_total!= "NA")
all_dt_B_tidy <- all_dt_B %>%
distinct() %>%
mutate_at(c(2), ~replace(., is.na(.), 0))%>%
group_by(adm_0_name) %>%
arrange(desc(calendar_start_date))
admin_NA <- all_dt_B_tidy %>%
dplyr::filter(adm_0_code==0)
#all_dt_B_tidy$adm_0_code[all_dt_B_tidy$adm_0_name=="Virgin Islands (UK)"]<-"39"
#all_dt_B_tidy$adm_0_code[all_dt_B_tidy$adm_0_name=="Virgin Islands (US)"]<-"258"
all_dt_B_tidy<-all_dt_B_tidy %>%
pivot_longer(c(severe_dengue, dengue_total, Deaths),
names_to="dengue_classification", values_to="total")
all_dt_B_tidy <- all_dt_B_tidy[!(all_dt_B_tidy$dengue_classification == ""), ]
all_dt_B_tidy <- all_dt_B_tidy %>%
select(adm_0_name, adm_0_code, calendar_start_date, calendar_end_date,
dengue_classification, total, Serotype, source_id)
write.csv(all_dt_B_tidy,"processed_data/all_dt_B.csv", row.names = FALSE)
Since some cumulative counts may be revised down over time, we need to identify the rows in our dataset that contain such values. To achieve this, we will use a for loop. For instance, if the cumulative count for week 5 was initially 100 but revised down to 90 in week 6, we will tag week 5 as having revised down values. Cumulative count for week 5 was compared with all subsequent rows (e.g., week 6 to week 53).
data$result <- FALSE
#split the dataset by country and year (16 countries * 9 years = 144 datasets) and save them into a list
dt_list <- data %>%
group_by(adm_0_name, year)%>%
group_split()
# iterate over each dataset
for (x in 1:length(dt_list)){
df <- dt_list[[x]]
if (nrow(df) < 2) next
# iterate over each row
for (i in 1:(nrow(df) - 1)) {
# if ith row contains NA, skip this row and go to next row
if (is.na(df$dengue_total_cumulative[i])) next
# iterate over each subsequent row
for (j in (i + 1):nrow(df)) {
# skip the row if it's NA and compare with the next row instead
if (is.na(df$dengue_total_cumulative[j])) next
# compare ith and jth row and if ith row contains value that is larger than a value in jth value, return "TRUE" in "result" column.
if (df$dengue_total_cumulative[i] > df$dengue_total_cumulative[j]) {
dt_list[[x]]$result[i] <- TRUE
# testing for loop
# print(paste0("I compared ", i, "th row and ", j, " row of dt_list ", x))
}
}
}
# the last row is not compared, so set its value to NA
#dt_list[[x]]$result[nrow(df)] <-NA
}
view(dt_list[[1]])
# merging datasets back
data2 <- rbindlist(dt_list)
When the total sum of weekly = annual count (assuming annual count data is correct), NA values were identified and replaced with count values for the following three use cases.
summary(cum_dt$dengue_total_cumulative2) # NA=8679 (decreased from 8978)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 4 440 35908 6259 2363490 8679
summary(cum_dt2$dengue_total_cumulative2) # NA=7648 (decreased from 8679)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 6 375 33751 5501 2363490 7648
summary(cum_dt3$dengue_total_cumulative3) #NA 7040 (decreased from 7648)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 6 314 32583 5086 2363490 7040
We first attempted using ‘na.approx’ function of ‘zoo’ package. Different imputation approaches will be explored later on.
#create epidemiological week 0 row to permit imputation for weeks where epidemiological week has no reported cumulative count.
data3 <- cum_dt3 %>%
as_tibble() %>%
group_by(adm_0_name) %>%
group_modify(~ add_row(.x,year=2014:2022,
epidemiological_week = 0,
dengue_total_cumulative3 = 0 ,.before = 0))%>%
arrange(adm_0_name, year, epidemiological_week)
data3 <- data3 %>%
group_by(adm_0_name, year)%>%
mutate(dengue_na_approx = as.integer(na.approx(dengue_total_cumulative3, na.rm=FALSE, rule = 2, maxgap=2)))%>%
mutate(dengue_na_spline = as.integer(na.spline(dengue_total_cumulative3, na.rm=FALSE, maxgap=2, method="hyman")))%>%
ungroup()
# maxgap=2 argument is used to specify that the function can fill in missing values up to two consecutive positions in the vector. rule=2 argument was used to ensure that the interpolated values are always greater than or equal to the observed values in the vector.
summary(data3$dengue_total_cumulative3)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 4 256 31730 4792 2363490 7040
summary(data3$dengue_na_approx)# NA 6823 (decreased from 7040)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 4 270 31436 4825 2363490 6823
# Extract the number of consecutive imputed values for each year and each country
consec <- data.frame(unclass(rle(is.na(data3$dengue_total_cumulative3))))
# The 'lengths' column indicates the number of consecutive rows and the 'values' column indicates whether cumulative counts are NAs (values==TRUE) or not.
# Highlight those year & epidemiological weeks where there are >= 6 imputed values consecutively, so that these can be discarded moving forward (as we have less confidence in imputed values over periods of longer than 6 weeks)
consec <- consec %>%
mutate(values2 = ifelse(lengths < 6 & values == TRUE, FALSE, values))
consec <- as.data.frame(lapply(consec, rep, consec$lengths))
data3 <- cbind(data3, consec)
Compare the original raw data, revised data, and imputed data. The
black dotted line indicates original raw values but were excluded
because those were revised down later, the red line indicates the
revised & gap filled data (but not yet imputed), and the blue line
shows the imputed data.
Prepare dataset for converting cumulative to incident counts. This is more straightforward for data where we have consecutive epidemiological weeks, as a lag column can be created and subtracted from the previous weeks count.
## # A tibble: 7 × 3
## # Groups: source_cat [4]
## source_cat source_cat2 n
## <chr> <chr> <int>
## 1 Imputed Imputed 229
## 2 NA Revised-NA 151
## 3 NA <NA> 6672
## 4 Raw Raw 15462
## 5 Zero Zero1 287
## 6 Zero Zero2 1031
## 7 Zero Zero3 608
Of 24,440 rows, 151 rows have been replaced with NAs following the ‘Revised down’ stage. In total 2,155 gaps have been filled following the ‘Zero filling’ and ‘Imputation’ stage.
Following is an example of comparing different imputation methods for Panama and Argentina. Na.spline() function gives better imputation results than na.approx() when converted to incidence count. Na.spline() will only be included in our final dataset.
## [1] "adm_0_name" "adm_0_code"
## [3] "year" "epidemiological_week"
## [5] "calendar_start_date" "calendar_end_date"
## [7] "cum_start_date" "cum_end_date"
## [9] "cum_interval" "dengue_total_cumulative"
## [11] "dengue_total_cumulative3" "dengue_na_approx2"
## [13] "dengue_na_spline2" "result"
## [15] "source_id" "source_id2"
## [17] "source_cat" "source_cat2"
## [19] "dengue_total_lag" "dengue_total_inc"
## [21] "dengue_na_approx_lag" "dengue_na_approx_inc"
## [23] "dengue_na_spline_lag" "dengue_na_spline_inc"
Let’s see how the data coverage has been updated so far. The figure below shows the weekly data coverage (2014-2022) by sources (raw, imputed, revised, not available) and country. Imputation filled in the gap, whereas revision may have created more. NA refers to the case when the original raw data was not available; Revised-NA refers to the case when the original raw data was available but has been replaced with NAs because they were revised down over time.
So in summary, what changes have been made to the original raw
datasets?